**Team Members**: Christian Peralta, Camille Beatrice Valera, Sebastian Urs Wey
For the Machine Learning Kaggle Competition 2022, we were given a dataset derived from the United States Census Bureau (USCB) https://www.census.gov/. The USBC conducts various yearly surveys; as well as the decennial census, which produces data about the U.S. population and its economy. Data obtained essentially enables federal and local governments to make educated decisions regarding the allocation of federal funds, international trade, health, housing, and other influential elements to the standard of living.
Based on the demographic predictors of the USBC dataset, our main objectives is to model and predict the income level of U.S. residents participating in the 2020-2021 survey. In the subsequent sections, we outline the specific approaches and thought-process undertaken for the following:
Supplementary notebooks have also been included to demonstrate other methods tested and further justify the final model selected for submission.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import timeit
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import ParameterGrid, StratifiedKFold, GridSearchCV, cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import accuracy_score
from yellowbrick.classifier import ROCAUC, confusion_matrix
# Importing the datasets
test_df = pd.read_csv('test.csv')
train_df = pd.read_csv('train.csv')
In the training data set (train_df), we have 41 columns, including the response variable high_income. There are a total of 54607 entries with many features containing NAs. The fraction of NAs differ between the features and can represent more than 50% in some variables (e.g. mig_prev_sunbelt).
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 54607 entries, 0 to 54606 Data columns (total 41 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 54607 non-null int64 1 occ_code_level2 54607 non-null int64 2 age 29892 non-null float64 3 stock_dividends 29720 non-null float64 4 mig_chg_msa 26888 non-null object 5 tax_filer_stat 29121 non-null object 6 det_hh_summ 54607 non-null object 7 mig_prev_sunbelt 26888 non-null object 8 hisp_origin 54377 non-null object 9 education 29786 non-null object 10 wage_per_hour 29712 non-null float64 11 capital_losses 29850 non-null float64 12 vet_question 54607 non-null object 13 own_or_self 54607 non-null int64 14 country_self 53532 non-null object 15 mig_move_reg 26888 non-null object 16 high_income 54607 non-null int64 17 hs_college 29771 non-null object 18 class_worker 54607 non-null object 19 mig_same 54607 non-null object 20 unemp_reason 29642 non-null object 21 state_prev_res 54423 non-null object 22 ind_code_level2 54607 non-null int64 23 race 29759 non-null object 24 country_mother 52818 non-null object 25 capital_gains 29716 non-null float64 26 sex 30073 non-null object 27 ind_code_level1 54607 non-null object 28 citizenship 54607 non-null object 29 union_member 54607 non-null object 30 fam_under_18 54607 non-null object 31 marital_stat 29126 non-null object 32 region_prev_res 54607 non-null object 33 mig_chg_reg 26888 non-null object 34 country_father 52581 non-null object 35 occ_code_level1 54607 non-null object 36 full_or_part_emp 54607 non-null object 37 weeks_worked 30410 non-null float64 38 det_hh_fam_stat 54607 non-null object 39 num_emp 54607 non-null int64 40 vet_benefits 54607 non-null int64 dtypes: float64(6), int64(7), object(28) memory usage: 17.1+ MB
We further observe that the first feature id is redundant, as we assume that the data is already well shuffled. In addition to NaN (which represent NAs), we often observe an entry called Not in universe. This entry should not be confused with NA since it indicates that the person completing the questionnaire is not concerned by the question. Other notable obsevations we discovered was that some categorical variables contain a high number of different categories, e.g. the variable education or the variable country_father.
train_df.head(10)
| id | occ_code_level2 | age | stock_dividends | mig_chg_msa | tax_filer_stat | det_hh_summ | mig_prev_sunbelt | hisp_origin | education | ... | marital_stat | region_prev_res | mig_chg_reg | country_father | occ_code_level1 | full_or_part_emp | weeks_worked | det_hh_fam_stat | num_emp | vet_benefits | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 42.0 | 0.0 | NaN | Nonfiler | Householder | NaN | All other | 11th grade | ... | NaN | Not in universe | NaN | United-States | Not in universe | Not in labor force | 0.0 | Householder | 0 | 2 |
| 1 | 2 | 18 | 56.0 | NaN | NaN | NaN | Householder | NaN | All other | High school graduate | ... | Married-civilian spouse present | Not in universe | NaN | United-States | Sales | Full-time schedules | NaN | Householder | 1 | 2 |
| 2 | 3 | 26 | 26.0 | NaN | NaN | Joint both under 65 | Householder | NaN | All other | High school graduate | ... | NaN | Not in universe | NaN | Haiti | Adm support including clerical | Full-time schedules | NaN | Householder | 3 | 2 |
| 3 | 4 | 0 | 67.0 | NaN | MSA to MSA | Joint one under 65 & one 65+ | Householder | No | All other | NaN | ... | NaN | Northeast | Same county | United-States | Not in universe | Children or Armed Forces | 0.0 | Householder | 0 | 1 |
| 4 | 5 | 0 | NaN | NaN | Nonmover | Nonfiler | Child under 18 never married | Not in universe | All other | Children | ... | NaN | Not in universe | Nonmover | United-States | Not in universe | Children or Armed Forces | NaN | Child <18 never marr not in subfamily | 0 | 0 |
| 5 | 6 | 36 | NaN | NaN | NonMSA to nonMSA | NaN | Householder | No | All other | NaN | ... | NaN | South | Same county | United-States | Machine operators assmblrs & inspctrs | Children or Armed Forces | NaN | Nonfamily householder | 6 | 2 |
| 6 | 7 | 26 | 40.0 | NaN | Nonmover | Single | Householder | Not in universe | All other | High school graduate | ... | Divorced | Not in universe | Nonmover | United-States | Adm support including clerical | Children or Armed Forces | 52.0 | Householder | 6 | 2 |
| 7 | 8 | 11 | 39.0 | 250.0 | Nonmover | Single | Householder | Not in universe | All other | NaN | ... | NaN | Not in universe | Nonmover | United-States | Professional specialty | Children or Armed Forces | 52.0 | Nonfamily householder | 6 | 2 |
| 8 | 9 | 0 | 51.0 | 0.0 | Nonmover | Nonfiler | Householder | Not in universe | Other Spanish | 1st 2nd 3rd or 4th grade | ... | NaN | Not in universe | Nonmover | Puerto-Rico | Not in universe | Children or Armed Forces | 0.0 | Householder | 0 | 2 |
| 9 | 10 | 0 | 44.0 | 0.0 | Nonmover | Head of household | Householder | Not in universe | All other | Some college but no degree | ... | Divorced | Not in universe | Nonmover | United-States | Not in universe | Children or Armed Forces | 26.0 | Householder | 2 | 2 |
10 rows × 41 columns
Overlapping Variables: The following examples show that some variables are obviously overlapping. For example, ind_code_level and ind_code_level2 contain the exact same number of answers for some categories.
train_df["ind_code_level1"].value_counts().head(7)
Not in universe or children 21436 Retail trade 4363 Manufacturing-durable goods 3531 Education 2725 Finance insurance and real estate 2535 Manufacturing-nondurable goods 2312 Other professional services 2088 Name: ind_code_level1, dtype: int64
train_df["ind_code_level2"].value_counts().head(7)
0 21436 33 4363 43 2725 45 2088 4 1884 42 1658 37 1411 Name: ind_code_level2, dtype: int64
The same is true for the variables det_hh_fam_stat and det_hh_summ, where the variable det_hh_summ summarizes some variables of det_hh_fam_stat. For instance, the number of householder in det_hh_summ is 26295, which is equal to the number of householder (19830) plus the number of nonfamily householder (6465) in the variable det_hh_fam_stat.
train_df["det_hh_fam_stat"].value_counts().head(7)
Householder 19830 Spouse of householder 10981 Child <18 never marr not in subfamily 10280 Nonfamily householder 6465 Child 18+ never marr Not in a subfamily 2526 Secondary individual 1578 Other Rel 18+ ever marr not in subfamily 445 Name: det_hh_fam_stat, dtype: int64
train_df["det_hh_summ"].value_counts().head(7)
Householder 26295 Spouse of householder 10982 Child under 18 never married 10299 Child 18 or older 3043 Other relative of householder 2055 Nonrelative of householder 1893 Group Quarters- Secondary individual 29 Name: det_hh_summ, dtype: int64
The variable hs_college can also be considered redundant, as it only contains few information which is already included in the variable education.
train_df["hs_college"].value_counts()
Not in universe 28231 High school 1015 College or university 525 Name: hs_college, dtype: int64
Skewed variables: We also discovered that few variables are heavily skewed:
stock_dividends, wage_per_hour, capital_losses, capital_gains
# Capital gains
sns.displot(data=train_df['capital_gains'], kind="hist", kde=True).set(title='Distribution of capital gains');
# Capital losses
sns.displot(data=train_df['capital_losses'], kind="hist", kde=True).set(title="Distribution of capital losses");
# Stock dividends
sns.displot(data=train_df['stock_dividends'], kind="hist", kde=True).set(title="Distribution of stock_dividends");
# Wage per hour
sns.displot(data=train_df['wage_per_hour'], kind="hist", kde=True).set(title="Distribution of wage_per_hour");
Once exploratory data analysis was completed, our team initially opted to pursue different strategies in order to surpass the first baseline. One team member was able to obtain a Kaggle score of 86% using an optimized random forest classifier and SimpleImputer (strategy=mean) prior to any data cleaning and including redundant variables such as id. While this approach generated a high score, the resulting training data set was immense and computationally demanding due to the high number of different categories which were converted into dummy variables. Alternatively, another team member applied an appraoch utilizing "cleaned data" -- this ultimately resulted in lower scores on the public leaderboard, which made it difficult for us to agree on which variables should be removed or summarized.
We finally implemented the following:
id. Given the description of the dataset, we theorize the that the dataframe is well-shuffled and the variable id should not contain any information for the prediction.state_prev_res is a more detailed variable of region_prev_res, we decided to keep the latter.ind_code_level2 and ind_code_level1 contain the same information. However, ind_code_level2 is encoded and seems to be more detailed than ind_code_level1. We can see this in the lists above which show the value counts for both variables. Therefore, we decided to keep the summarized variable ind_code_level1.det_hh_summ is a summarized version of the variable det_hh_fam_stat. We decided to drop det_hh_fam_stat and keep det_hh_summ.hs_college because it should presumably contain the same information as the variable education.country_self, country_mother and country_father, we summarized entries as non-US or US.education with OrdinalEncoder given that it can be ordered in a logical way. The other qualitative variables were transformed into dummy variables.These decisions are encoded in the function DataCleaning() which we then applied to both the test and the train data set.
# create a function which can be applied on both dataframes
def DataCleaning(df):
# remove variables
df.drop(columns=['id', 'state_prev_res', 'ind_code_level2', 'det_hh_fam_stat', 'hs_college'], inplace=True)
# summarize categories
df["country_self"].replace("(?m)^(?!United-States).*", "Non-US", regex=True, inplace=True)
df["country_father"].replace("(?m)^(?!United-States).*", "Non-US", regex=True, inplace=True)
df["country_mother"].replace("(?m)^(?!United-States).*", "Non-US", regex=True, inplace=True)
# encode variable education
ordenc = OrdinalEncoder(categories=['Children', 'Less than 1st grade', '1st 2nd 3rd or 4th grade', '5th or 6th grade', '7th and 8th grade', '9th grade',
'10th grade', '11th grade', '12th grade no diploma', 'High school graduate', 'Some college but no degree',
'Associates degree-occup /vocational', 'Associates degree-academic program', 'Bachelors degree(BA AB BS)',
'Masters degree(MA MS MEng MEd MSW MBA)', 'Doctorate degree(PhD EdD)', 'Prof school degree (MD DDS DVM LLB JD)'],
handle_unknown="use_encoded_value", unknown_value=np.nan)
df["education"] = ordenc.fit_transform(df.loc[:,['education']])
# return dataframe
return df
# Apply DataCleaning() on both datasets
train_df = DataCleaning(train_df)
test_df = DataCleaning(test_df)
Addtional Data Preparation: In order to make the data accessible for different models, further modifications are needed. They can be summarized as follows:
# 1. Add column for later separation. Append test_df to train_df.
train_df['train_test'] = 'train'
test_df['train_test'] = 'test'
df_all = train_df.append(test_df)
# 1.1. Transform Categorical Variables
cols = list(df_all.select_dtypes(include=['object']).columns)
df_all[cols] = df_all[cols].astype('category')
# 2.1. NA-Imputation on categorical data (most_frequent)
variables_cat = list(df_all.select_dtypes(include=['category']).columns)
imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
df_cat = pd.DataFrame(imp.fit_transform(df_all[variables_cat]))
df_cat.columns=df_all[variables_cat].columns
# 2.2 NA-Imputation on numeric variables (mean for non-skewed, median for skewed)
variables_numeric = list(set(df_all.columns) - set(variables_cat))
variables_skewed = ["stock_dividends", "wage_per_hour", "capital_losses", "capital_gains"]
variables_nonskewed = list(set(variables_numeric) - set(variables_skewed))
# Imputation : non- skewed variables
imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
df_nonskewed = pd.DataFrame(imp.fit_transform(df_all[variables_nonskewed]))
df_nonskewed.columns=df_all[variables_nonskewed].columns
# Imputation: skewed variables
imp = SimpleImputer(missing_values=np.NaN, strategy='median')
df_skewed = pd.DataFrame(imp.fit_transform(df_all[variables_skewed]))
df_skewed.columns=df_all[variables_skewed].columns
# 2.3 Concat dataframes
df_all = pd.concat([df_skewed, df_nonskewed, df_cat], axis = 1)
# 3. Create Dummy Variables: apply get_dummies()
df_all = pd.get_dummies(df_all, drop_first=True)
# 4. Separate dataframes again
# train_df
train_df = df_all.loc[df_all.train_test_train==1]
train_df = train_df.drop("train_test_train", axis=1)
# Kaggle test dataset
X_kaggle = df_all.loc[df_all.train_test_train==0]
X_kaggle = X_kaggle.drop("train_test_train", axis=1)
X_kaggle = X_kaggle.drop("high_income", axis=1)
# 5. Create a copy of train_df for later use in final model fitting
train_df_original = train_df.copy()
# 6. Split train_df (for model comparison)
X_comp = train_df.drop("high_income", axis=1)
y_comp = train_df["high_income"]
X_train, X_test, y_train, y_test = train_test_split(X_comp, y_comp, test_size=0.1, shuffle=True, random_state=12)
# 6.1 Create copies of X_train, X_test, y_train and y_test
X_train_original = X_train.copy()
X_test_original = X_test.copy()
y_train_original = y_train.copy()
y_test_original = y_test.copy()
The primary objetive of the problem is to predict whether an individual either has high income or low income given the data set. Hence, it is a classification problem. To address this challenge our team decided to start with the simplest models available and later use more complex models.
In all cases, each model performance was tested with and without hyperparameter tuning. Models which performed with at least 80% accuracy on the test set were retained for stacking. All of the "best" performing models were then further optimized and evaluated to select the final model. Other models which were below the 80% baseline we set are described in the supplmentary notebooks.
In this subsection, a random forest classifier is fitted on our data. The initial accuracy for a prediction on the separated test data set was 85.6%. This initial score was improved by parameter tuning (values printed at the end of the code chunks). To ensure that the number of trees (n_estimators) is high enough, we cross-validated two different number of trees (theoretically, no overfitting can occur when taking a value too high). We approached the values in the parameter grid initially by using out-of-bag score. This is computationally faster. However, it is not directly comparable to the cross-validation score used for other models, which is why we finally tune the parameters using cross-validation.
Random forest on such a big dataset is computationally very demanding. By setting max_samples = 0.1, only 10% of the data is used for each split, which accelerates the process. Another approach could have been to further summarize the categorical variables so that less dummy variables are created.
The final obtained best parameters are:
n_estimators = 250max_features = 18min_samples_leaf = 1max_depth = None# Restore training and test data set
X_train = X_train_original.copy()
X_test = X_test_original.copy()
y_train = y_train_original.copy()
y_test = y_test_original.copy()
# Model without Optimizazion
rfc = RandomForestClassifier(random_state=12)
rfc.fit(X_train, y_train)
print(f"Accuracy score with default parameters : {accuracy_score(y_true=y_test, y_pred=rfc.predict(X_test))}")
Accuracy score with default parameters : 0.854971616919978
# Optimization
start_time = timeit.default_timer()
# Number of trees in random forest
n_estimators = [100, 250]
# Number of features to consider at every split
max_features = ['sqrt', 16, 18, 20, 22, 24, 26, 28, 30]
# number of samples required at each leaf node
min_samples_leaf = [1, 3, 5]
# max_depth
max_depth = [20, 35, None]
# Create the random grid
param_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'min_samples_leaf': min_samples_leaf,
'max_depth': max_depth}
# Create the Classifier. For computational reasons, use max_samples=0.1
rfc = RandomForestClassifier(random_state=12, max_samples=0.1)
# Instantiate the grid search model
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv=GridSearchCV(estimator=rfc, param_grid=param_grid, scoring="accuracy", cv=folds, n_jobs=-1)
# Fit the grid search to the data
grid_cv.fit(X_train, y_train)
best_param = grid_cv.best_params_
best_score = grid_cv.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
elapsed = timeit.default_timer() - start_time
elapsed_min = elapsed/60
print("Time [min]:", elapsed_min)
# Kaggle Score: 0.86098
Best parameters: {'max_depth': None, 'max_features': 30, 'min_samples_leaf': 1, 'n_estimators': 250}
Best score: 0.8605380097324472
Time [min]: 43.14529777925
# Save mean CV score for model evaluation
rfc_meanCV = grid_cv.best_score_
# Save SD for model evaluation
rfc_stdCV = grid_cv.cv_results_['std_test_score'][grid_cv.best_index_]/np.sqrt(5)
# Model with optimization
rfc_final = grid_cv.best_estimator_
rfc_final.fit(X_train, y_train)
rfc_test_score = accuracy_score(y_true=y_test, y_pred=rfc_final.predict(X_test))
print("Accuracy score on the separated test data set:", rfc_test_score)
Accuracy score on the separated test data set: 0.8699871818348288
Similar to the random forest model, we first created a gradient boosting classifier model utilizing its default settings in order to have a baseline reference for the optimization steps.
# Recall training and test data set
X_train = X_train_original.copy()
X_test = X_test_original.copy()
y_train = y_train_original.copy()
y_test = y_test_original.copy()
# Baseline Model (without Optimizazion)
GBC = GradientBoostingClassifier(random_state=12)
GBC.fit(X_train,y_train)
print(f"Accuracy score with default parameters: {accuracy_score(y_true=y_test, y_pred=GBC.predict(X_test))}")
#Compare to KAGGLE Import Score = 0.85109 (without optimization)
Optimization was conducted step-wise based on the parameter's importance, or impact on the overall model:
For reference, the baseline, or default values for the parameters optimized are:
# OPTIMIZATION STEP 1 (LEARNING RATE and NUMBER OF ESTIMATORS)
#Creating the GridSearchCV parameters
parameters = {
"learning_rate": [0.01, 0.1, 0.2, 0.3],
"n_estimators":[325,330,350,400]
}
gb = GradientBoostingClassifier(random_state=12, subsample=0.1)
#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC1 = GridSearchCV(estimator=gb, param_grid=parameters, scoring='accuracy', cv=folds, n_jobs=-1)
#Fitting the Model
GBC1.fit(X_train, y_train)
best_param = GBC1.best_params_
best_score = GBC1.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'learning_rate': 0.1, 'n_estimators': 325}
Best score: 0.8629593805860478
# Model with Optimization STEP 1
GBC_fit1= GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, random_state=12)
GBC_fit1.fit(X_train, y_train)
GBC_accuracy_fit1 = accuracy_score(y_true=y_test, y_pred=GBC_fit1.predict(X_test))
print(GBC_accuracy_fit1)
#Score with learning_rate: 0.1 with n_estimators: 250 = 0.8727
#Score with learning_rate: 0.1 with n_estimators: 300 = 0.8722
#Score with learning_rate: 0.1 with n_estimators: 310 = 0.87328
#Score with learning_rate: 0.1 with n_estimators: 320 = 0.87369
#Score with learning_rate: 0.1 with n_estimators: 325 = 0.874199
#Score with learning_rate: 0.1 with n_estimators: 330 = 0.875114
0.8751144479033144
# OPTIMIZATION STEP 2 (MAX DEPTH)
#Creating the GridSearchCV parameters
parameters_2 = {
"max_depth":[2,3,4,5]
}
gb1 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, random_state=12, subsample=0.1)
#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC2 = GridSearchCV(estimator=gb1, param_grid=parameters_2, scoring='accuracy', cv=folds, n_jobs=-1)
#Fitting the Model
GBC2.fit(X_train, y_train)
best_param = GBC2.best_params_
best_score = GBC2.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'max_depth': 3}
Best score: 0.8624710297873909
# Model with Optimization Step 2
GBC_fit2 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=4, random_state=12)
GBC_fit2.fit(X_train, y_train)
GBC_accuracy_fit2 = accuracy_score(y_true=y_test, y_pred=GBC_fit2.predict(X_test))
print(GBC_accuracy_fit2)
#Score with learning_rate: 0.1 with n_estimators: 325, max_depth:2 = 0.868522
#Score with learning_rate: 0.1 with n_estimators: 325, max_depth:3 = 0.874199
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth:3 = 0.875114
0.8743819813221022
# OPTIMIZATION STEP 3 (MIN SAMPLES SPLIT)
#Creating the GridSearchCV parameters
parameters_3 = {
"min_samples_split":[2,3,4,5]
}
gb2 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, random_state=12, subsample=0.1)
#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC3 = GridSearchCV(estimator=gb2, param_grid=parameters_3, scoring='accuracy', cv=folds, n_jobs=-1)
#Fitting the Model
GBC3.fit(X_train, y_train)
best_param = GBC3.best_params_
best_score = GBC3.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'min_samples_split': 3}
Best score: 0.8626745217067396
# Model with Optimization Step 3
GBC_fit3 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, random_state=12)
GBC_fit3.fit(X_train, y_train)
GBC_accuracy_fit3 = accuracy_score(y_true=y_test, y_pred=GBC_fit3.predict(X_test))
print(GBC_accuracy_fit3)
#Score with learning_rate: 0.1 with n_estimators: 325, max_depth:3, min_samples_split: 2 = 0.8733
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth:3, min_samples_split: 2 = 0.875114
0.8751144479033144
# OPTIMIZATION STEP 4 (MIN SAMPLES LEAF)
#Creating the GridSearchCV parameters
parameters_4 = {
"min_samples_leaf":[1,3,5,7]
}
gb3 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, random_state=12, subsample=0.1)
#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC4 = GridSearchCV(estimator=gb3, param_grid=parameters_4, scoring='accuracy', cv=folds, n_jobs=-1)
#Fitting the Model
GBC4.fit(X_train, y_train)
best_param = GBC4.best_params_
best_score = GBC4.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'min_samples_leaf': 5}
Best score: 0.8659504774782038
# Model with Optimization Step 4
GBC_fit4 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=5, random_state=12)
GBC_fit4.fit(X_train, y_train)
GBC_accuracy_fit4 = accuracy_score(y_true=y_test, y_pred=GBC_fit4.predict(X_test))
print(GBC_accuracy_fit4)
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 5 = 0.87218
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 3 = 0.87347
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 2 = 0.87438
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 1 = 0.87511
0.8721845815784655
# OPTIMIZATION STEP 5 (MAX FEATURES)
#Creating the GridSearchCV parameters
parameters_5 = {
"max_features":['sqrt',20,50,None]
}
gb4 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=5, random_state=12, subsample=0.1)
#Passing the scoring function in the GridSearchCV
folds = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)
GBC5 = GridSearchCV(estimator=gb4, param_grid=parameters_5, scoring='accuracy', cv=folds, n_jobs=-1)
#Fitting the Model
GBC5.fit(X_train, y_train)
best_param = GBC5.best_params_
best_score = GBC5.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
Best parameters: {'max_features': None}
Best score: 0.8659504774782038
# Model with Optimization Step 5
GBC_fit5 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=5, max_features=None, random_state=12)
GBC_fit5.fit(X_train, y_train)
GBC_accuracy_fit5 = accuracy_score(y_true=y_test, y_pred=GBC_fit5.predict(X_test))
print(GBC_accuracy_fit5)
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 5, max_features: None = 0.87218
#Score with learning_rate: 0.1 with n_estimators: 330, max_depth: 3, min_samples_split: 2, min_samples_leaf: 1, max_features: None = 0.87511
0.8721845815784655
As seen in the optimization performed above, certain steps demonstrated that the default parameters were the most appropriate for parameters: learning_rate, max_depth, min_samples_split, and max_features.
On the other hand, optimization for n_estimators and min_samples_leaf suggested that 330 and 5 were the "best" values respectively. Used in prediction, n_estimators = 330 resulted in a higher accuracy score while min_samples_leaf = 5 yielded lower results compared to its default setting of 1. (See Model with Optimization Step 5 for the difference in scores.)
Thus, the final gradient boosting classification model parameter values selected were:
learning_rate = 0.1n_estimators = 330max_depth = 3min_samples_split = 2min_samples_leaf = 1max_features = None# FINAL Model with Optimization Combined Steps
GBC_final = GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3, min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=12)
GBC_final.fit(X_train, y_train)
GBC_test_score = accuracy_score(y_true=y_test, y_pred=GBC_final.predict(X_test))
print("Accuracy score on the separated test data set:", GBC_test_score)
#GBC Final Model Accuracy = 0.8751144479033144
#KAGGLE Score = 0.85604 --> compared to UNOPTIMIZED Score = 0.85109
Accuracy score on the separated test data set: 0.8751144479033144
#Calculating Mean CV and Standard Deviation for Model Evaluation
GBC_mean_final, GBC_std_final = np.mean(cross_val_score(GBC_final, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(GBC_final, X=X_test, y=y_test, cv=5))/np.sqrt(5)
LDA classifier produces linear decision boundaries, known as hyperplanes which is a flat affine subspace of dimention $p-1$. LDA fits a Gaussian density to each class, assuming that all classes share the same covariance matrix.
It was attempted to apply regularization through the shrinkage parameter implemented in sk-learn which, according to the library, "shrinkage is a form of regularization used to improve the estimation of covariance matrices". However, the accuracy score was the same with and without regularization: 84.91%.
lda_pipe = Pipeline([("scaler", StandardScaler()), ("lda", LinearDiscriminantAnalysis(solver="lsqr"))])
lda_pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('lda', LinearDiscriminantAnalysis(solver='lsqr'))])
GridSearchCV was used to find the optimal value for the shrinkage hyper-parameter.
shrinkage = 10**np.linspace(-4,0, 100)
#use of StratifiedKFold
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv=GridSearchCV(estimator=lda_pipe, param_grid={"lda__shrinkage":shrinkage}, scoring="accuracy", cv=folds, n_jobs=-1)
search_cv = grid_cv.fit(X_train,y_train)
grid_cv.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()),
('lda',
LinearDiscriminantAnalysis(shrinkage=0.0004430621457583882,
solver='lsqr'))])
print(f"LDA test accuracy with default parameters = {accuracy_score(y_true=y_test, y_pred=lda_pipe.predict(X_test))}")
LDA test accuracy with default parameters = 0.8491118842702802
print(f"LDA test accuracy with parameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
LDA test accuracy with parameter tuning: 0.8491118842702802
As aforementioned, LDA's best performance is with default parameters and tuning the shrinkage does not beat this score.
For a fair comparison with other models, CV is performed on LDA with default parameters to obtain the mean and the standard deviation.
LDA_mean_final, LDA_std_final = np.mean(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5))/np.sqrt(5)
print(LDA_mean_final, LDA_std_final)
0.8459978417434959 0.003905247462099063
Continuing with linear classifiers, logistic regression (logreg) was trained on our data. The logistic model assumes that $Y|X=x_{i}$ follows a Bernoulli distribution, the opposite of LDA which assumes Gaussian.
To reduce the risk of overfitting and reduce model complexity, logreg was explored with both ridge and lasso regularizations.
# ======== Plain LogReg ========
logReg = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(solver="saga", random_state=12, n_jobs=4, tol=1e-2))])
logReg.fit(X_train, y_train)
print(f'LogReg test accuracy with default parameters: {accuracy_score(y_test, logReg.predict(X_test))}')
LogReg test accuracy with default parameters: 0.8659586156381615
Log_mean_CV, Log_std_CV = np.mean(cross_val_score(logReg, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(logReg, X=X_test, y=y_test, cv=5))/np.sqrt(5)
print(Log_mean_CV, Log_std_CV)
0.8507592438059044 0.0041327898644139185
Ridge regularization
# ======== Ridge regularization ========
log_reg_ridge = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(penalty="l2", solver="saga", C=1, random_state=12, n_jobs=4, tol=1e-2))])
log_reg_ridge.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('log_reg',
LogisticRegression(C=1, n_jobs=4, random_state=12,
solver='saga', tol=0.01))])
C =10**np.linspace(-4,0, 100)
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv=GridSearchCV(estimator=log_reg_ridge, param_grid={"log_reg__C":C}, scoring="accuracy", cv=folds, n_jobs=4)
search_cv=grid_cv.fit(X_train,y_train)
grid_cv.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()),
('log_reg',
LogisticRegression(C=0.07390722033525783, n_jobs=4,
random_state=12, solver='saga',
tol=0.01))])
best LogRidge estimator:
LogisticRegression(C=0.07390722033525783, n_jobs=4, random_state=12, solver='saga', tol=0.01))])
print(f"LogReg Ridge test accuracy after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
LogReg Ridge test accuracy after hyperparameter tuning: 0.8659586156381615
LogRidge_mean_CV, LogRidge_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogRidge_mean_CV, LogRidge_std_CV)
Lasso regularization
# ======== Lasso regularization ========
log_reg_lasso = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(penalty="l1", solver="saga", C=1, random_state=12, n_jobs=4, tol=1e-2))])
log_reg_lasso.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()),
('log_reg',
LogisticRegression(C=1, n_jobs=4, penalty='l1',
random_state=12, solver='saga',
tol=0.01))])
grid_cv=GridSearchCV(estimator=log_reg_lasso, param_grid={"log_reg__C":C}, scoring="accuracy", cv=folds, n_jobs=4)
search_cv=grid_cv.fit(X_train,y_train)
print(f"LogReg Lasso test accuracy after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
LogReg Lasso test accuracy after hyperparameter tuning: 0.8654092657022523
LogLasso_mean_CV, LogLasso_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogLasso_mean_CV, LogLasso_std_CV)
Assembly of Principal Component Analysis and Logistic Regression
Principal Component Analysis is an unsupervised machine learning model which aims at summarizing the data set into lower predictor spaces known as the Principal Components. PCA comes in handy to detect underlying trends of the data and from a feature engineering perspective, to perform feature extraction of predictors which are most correlated. Thus, the idea of exploring the logreg performance ensambled with PCA seemed appealing.
# ======== PCA + LogReg Ridge ========
pca_logReg_ridge = Pipeline([("scaler", StandardScaler()), ("pca",PCA(n_components=.95)) ,("log_reg", LogisticRegression(penalty="l2", solver="saga",
random_state=12, tol=1e-2))])
pca_logReg_ridge.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.95)),
('log_reg',
LogisticRegression(random_state=12, solver='saga', tol=0.01))])
C =10**np.linspace(-4,0, 100)
p_comp = [.90, .95, .99]
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv=GridSearchCV(estimator=pca_logReg_ridge, param_grid={"pca__n_components":p_comp ,"log_reg__C":C, }, scoring="accuracy", cv=folds, n_jobs=-1)
search_cv = grid_cv.fit(X_train, y_train)
grid_cv.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)),
('log_reg',
LogisticRegression(C=0.20565123083486536, random_state=12,
solver='saga', tol=0.01))])
best estimator:\ Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)), ('log_reg', LogisticRegression(C=0.20565123083486536, random_state=12, solver='saga', tol=0.01))])
Comparing the accuracy on test set between default and tuned hyperparameters:
print(f"PCA + LogReg Accuray on test set with default parameters: {accuracy_score(y_test, pca_logReg_ridge.predict(X_test))}")
PCA + LogReg Accuray on test set with default parameters: 0.8529573338216444
print(f"PCA + LogReg Ridge Accuray on test set after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
PCA + LogReg Ridge Accuray on test set after hyperparameter tuning: 0.8586339498260391
LogPCA_Ridge_mean_CV, LogPCA_Ridge_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogPCA_Ridge_mean_CV, LogPCA_Ridge_std_CV)
# ======== PCA + LogReg Lasso ========
pca_logReg_lasso = Pipeline([("scaler", StandardScaler()), ("pca",PCA(n_components=.95)) ,("log_reg", LogisticRegression(penalty="l1", solver="saga",
random_state=12, tol=1e-2))])
pca_logReg_lasso.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.95)),
('log_reg',
LogisticRegression(penalty='l1', random_state=12,
solver='saga', tol=0.01))])
grid_cv=GridSearchCV(estimator=pca_logReg_lasso, param_grid={"pca__n_components":p_comp ,"log_reg__C":C, }, scoring="accuracy", cv=folds)
search_cv =grid_cv.fit(X_train, y_train)
grid_cv.best_estimator_
Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)),
('log_reg',
LogisticRegression(C=0.9111627561154896, penalty='l1',
random_state=12, solver='saga',
tol=0.01))])
Best estimator\ Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.99)), ('log_reg', LogisticRegression(C=0.9111627561154896, penalty='l1', random_state=12, solver='saga', tol=0.01))])
Comparing the accuracy on test set between default and tuned hyperparameters:
print(f"PCA + LogReg Lasso test accuracy without applying regularization: {accuracy_score(y_test, pca_logReg_lasso.predict(X_test))}")
PCA + LogReg Lasso test accuracy without applying regularization: 0.8527742171763413
print(f"PCA + LogReg Lasso Accuray on test set after hyperparameter tuning: {accuracy_score(y_test, grid_cv.best_estimator_.predict(X_test))}")
PCA + LogReg Lasso Accuray on test set after hyperparameter tuning: 0.8586339498260391
LogPCA_Lasso_mean_CV, LogPCA_Lasso_std_CV = np.mean(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(grid_cv.best_estimator_, X=X_test, y=y_test, cv=5))/np.sqrt(5)
#print(LogPCA_Lasso_mean_CV, LogPCA_Lasso_std_CV)
The aim of this section is to provide an intuitive comparison of the performance of the logreg-based models that were built on this section.
Model hyperparameters were tuned using GridSearchCV. The best estimator was retrieved and cross-validation was performed on the test set for a fair comparison of the mean score and the standard deviation of the mean score.
accucary_test_df = pd.DataFrame({"Accuracy": [0.8507, 0.8492, 0.8443, 0.8441, 0.8447],
"std": [0.0041, 0.0041, 0.0040, 0.0042, 0.0041]},
index=["LogReg","LogRidge", "LogLasso", "PCA_LogRidge", "PCA_LogLasso"])
accuracy_test= px.scatter(data_frame=accucary_test_df, x=accucary_test_df.index, y="Accuracy",
symbol=accucary_test_df.index, template="presentation", error_y="std",
color=accucary_test_df.index, width=700,height=400,
labels={"index": "Estimators"}, title="Performance of different Estimators on Test Set")
accuracy_test.update_layout(font=dict(size=12), legend_title= "Estimators")
On one hand, there were models which did not perform better after hyperparameter tuning, an extreme case is LogLasso.
On the other hand, there were models with accuracy scores that improved after hyperparameter tuning, as was the case for both PCA-LogReg models. Nonetheless, its performance was still worse compared to plain LogReg.
As discussed during practical classes, ridge regularization aids in improving prediction accuracy while lasso aids in variable selection.
The decision to retain a model was based on theoretical aspects of improving model's prediction and reduction of model complexity. As shown in the graph above, the performace of LogReg and LogRidge models are somewhat similar in terms of their score and their imprecission. The standar deviations for the models also overlap; thus we suppose that LogRidge is the "best model" of this sub-section.
#### Computing mean cv score and std for the models of this section which will be included on stacking ####
# ==== LDA ====
LDA_mean_final, LDA_std_final = np.mean(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(lda_pipe, X=X_test, y=y_test, cv=5))/np.sqrt(5)
# ==== Logistic ====
log_ridge_final = Pipeline([("scaler", StandardScaler()), ("log_reg", LogisticRegression(penalty="l2", solver="saga", C=0.07390722033525783, random_state=12, tol=1e-2))])
LogReg_mean_final, LogReg_std_final = np.mean(cross_val_score(log_ridge_final, X=X_test, y=y_test, cv=5)), np.std(cross_val_score(log_ridge_final, X=X_test, y=y_test, cv=5))/np.sqrt(5)
LDA_test_score, LogRidge_test_score = 0.8491118842702802, 0.8659586156381615
This final model represents an ensemble of optimized models we previously presented. We include all four models: random forest, gradient boosting, linear discrimnant analysis, and logistic regression with the parameters obtained from optimization. As a final estimator, we use logistic regression. The model performs well, but was not able to outperform other methods on the separated test data set. We also uploaded a prediction based on the Kaggle test data set which performed slightly worse than the optimized random forest model.
# Define estimators used for stacking
estimators = [
('rfc', RandomForestClassifier(max_depth=None, max_features=18, min_samples_leaf = 1,
n_estimators=250, random_state=12, n_jobs=-1)),
("lda", make_pipeline(StandardScaler(),
LinearDiscriminantAnalysis(solver="lsqr"))),
('LogRidge', make_pipeline(StandardScaler(),
LogisticRegression(penalty="l2", solver="saga",
C=0.07390722033525783, random_state=12,
tol=1e-2))),
('gbc', GradientBoostingClassifier(learning_rate=0.1, n_estimators=330, max_depth=3,
min_samples_split=2, min_samples_leaf=1, max_features=None, random_state=12))]
# Define model ensemble
StackingClas = StackingClassifier(estimators=estimators, stack_method="predict_proba",
final_estimator=LogisticRegression(random_state=12),
n_jobs=-1)
# Train model
StackingClas.fit(X_train, y_train)
# Evaluate model on separated test data set
stacking_test_score = accuracy_score(y_true=y_test, y_pred=StackingClas.predict(X_test))
print("Accuracy score on separated test data set", stacking_test_score)
Accuracy score on separated test data set 0.8723676982237686
Here, we compare the different models which were developped in the previous sections to determine the final model for submission.
Since we used cross-validation for the single models, the cross-validation mean score is comparable across all models. Results are summarized in the table below.
# Accuracy score and SD for 10% of the data
model_accuracy_df = pd.DataFrame({"Accuracy": [rfc_meanCV, GBC_mean_final, LDA_mean_final, LogReg_mean_final],
"SD": [rfc_stdCV, GBC_std_final, LDA_std_final, LogReg_std_final]},
index=["RFC", "GBC", "LDA","LogRidge"])
model_accuracy_df
| Accuracy | SD | |
|---|---|---|
| RFC | 0.860538 | 0.001278 |
| GBC | 0.856434 | 0.004139 |
| LDA | 0.845998 | 0.003905 |
| LogRidge | 0.849294 | 0.004188 |
We initially sliced 10% of the training data for the evaluation of the stacked model. Predictions of all models in the previous section were conducted on this test data set; scores can be found in the following table.
# Test scores from tested models
test_score_df = pd.DataFrame({"Score": [rfc_test_score, GBC_test_score, LDA_test_score, LogRidge_test_score, stacking_test_score]},
index=["RFC", "GBC", "LDA", "LogRidge", "Stacking"])
test_score_df
| Score | |
|---|---|
| RFC | 0.869987 |
| GBC | 0.875114 |
| LDA | 0.849112 |
| LogRidge | 0.865959 |
| Stacking | 0.872368 |
Preliminary conclusion:
Random forest classifier has the best accuracy and standard deviation scores of cross-valiation; However, on the separated test data set, GBC and stacking perform better.
To further support our decision in determining the best model, we created confusion matrices of our top models, random forest classifier and gradient boosting classifier.
The matrices below evidence that classification accuracy of the random forest was notably higher at ~99.2% compared to the gradient booster's accuracy of ~87.5%. Type 1 and Type 2 errors also occurred less with the RFC model than the GBC model. Sensitiy and specificty of the RFC was also superior at ~99.2% and ~98.9% respectively; while GBC's model sensitivity was ~90.7% and sensitivity was ~80.3%.
# Random Forest Classifier
confusion_matrix(rfc_final, X_train, y_train, X_test, y_test, classes=["0", "1"], percent=False);
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but RandomForestClassifier was fitted with feature names "X does not have valid feature names, but"
# Gradient Boosting Classifier
confusion_matrix(GBC_final, X_train, y_train, X_test, y_test, classes=["0", "1"], percent=False);
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names "X does not have valid feature names, but"
Based on the accuracy and standard deviation scores as well as the confusion matrices; we conclude that the random forest classifier is our best predictive model.
While comparing testing scores across the various models did demonstrate that the gradient boosting classifier slightly outperforms other methods, the RFC testing score is nearly equal to GBC. Apart from the model evaluation conducted above, Kaggle submissions also showed that the optimized RFC outperforms the optimized GBC, with scores of 0.86098 and 0.85604 respectively. The stacked model also performed worse than the single optimized RFC model, which is why we decided not to explore this approach in more detail.
While we are not surprised with these results since RFCs are robust models (which often do not require hyperparameter tuning and are less prone to overfitting), we recognize that different data cleaning and NA imputation approaches impact the performance of models; no matter how slightly. Other imputation methods (summarized in the Supplementary Notebooks) or further summarization of categorial variables would have perhaps resulted in another final model if we investigated the different possibilities even further. Alternatively, other methods of hyperparameter tuning may have been applied to the gradient boosting model, which is at times known to outperform random forest models.
Nevertheless, due to limitations in time and computational power of our personal computers, we chose to move forward with the RFC as our final model.
Previously, we utilized 10% of the train data during optimization of the RFC for faster computation. Since we have determined that the RFC outperforms other modalities of classifiers, we used the whole training dataset for the final parameter optimization. Of note, grid parameters were not modified from the inital RFC model optimization. Steps for the final model fitting can be reviewed below.
# Reload complete train data set
train_df = train_df_original.copy()
# Create X and y
X = train_df.drop("high_income", axis=1)
y = train_df["high_income"]
# Final Random Forest Classifier Optimization
start_time = timeit.default_timer()
# Number of trees in random forest
n_estimators = [100, 250]
# Number of features to consider at every split
max_features = ['sqrt', 16, 18, 20, 22, 24, 26, 28, 30]
# number of samples required at each leaf node
min_samples_leaf = [1, 3, 5]
# max_depth
max_depth = [20, 35, None]
# Create the random grid
p_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'min_samples_leaf': min_samples_leaf,
'max_depth': max_depth}
# Create the Classifier
rfc_final_op = RandomForestClassifier(random_state=12)
# Instantiate the grid search model
folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
grid_cv_final=GridSearchCV(estimator=rfc_final_op, param_grid=p_grid, scoring="accuracy", cv=folds, n_jobs=-1)
# Fit the grid search to the data
grid_cv_final.fit(X, y)
best_param = grid_cv_final.best_params_
best_score = grid_cv_final.best_score_
print("Best parameters:", best_param)
print("Best score:", best_score)
elapsed = timeit.default_timer() - start_time
elapsed_min = elapsed/60
print("Time [min]:", elapsed_min)
Best parameters: {'max_depth': None, 'max_features': 28, 'min_samples_leaf': 3, 'n_estimators': 250}
Best score: 0.8616844424175281
Time [min]: 138.59173749463335
# Define Model
#RandomForestClassifier(max_depth=None , max_features=28, min_samples_leaf=3, n_estimators=250 , random_state=12, n_jobs=-1)
rfc_kaggle = grid_cv_final.best_estimator_
rfc_kaggle.fit(X,y)
# Random Forest Classifier Prediction on Kaggle Test Data
y_pred = rfc_kaggle.predict(X_kaggle)
# Create csv
index = np.linspace(1, len(y_pred), len(y_pred))
final_output = {'id': index, 'high_income': y_pred}
final_output = pd.DataFrame(final_output)
final_output["id"] = final_output["id"].astype('int64')
final_output["high_income"] = final_output["high_income"].astype('int64')
final_output.to_csv('final_output.csv', index=False, header=["id", "high_income"])
final_output.to_csv('final_output.csv')
10th position with an $accuracy = 0.85687$